CW-NET for multitype cell detection and classification in bone marrow examination and mitotic figure examination

Abstract Motivation Bone marrow (BM) examination is one of the most important indicators in diagnosing hematologic disorders and is typically performed under the microscope via oil-immersion objective lens with a total 100× objective magnification. On the other hand, mitotic detection and identification is critical not only for accurate cancer diagnosis and grading but also for predicting therapy success and survival. Fully automated BM examination and mitotic figure examination from whole-slide images is highly demanded but challenging and poorly explored. First, the complexity and poor reproducibility of microscopic image examination are due to the cell type diversity, delicate intralineage discrepancy within the multitype cell maturation process, cells overlapping, lipid interference and stain variation. Second, manual annotation on whole-slide images is tedious, laborious and subject to intraobserver variability, which causes the supervised information restricted to limited, easily identifiable and scattered cells annotated by humans. Third, when the training data are sparsely labeled, many unlabeled objects of interest are wrongly defined as background, which severely confuses AI learners. Results This article presents an efficient and fully automatic CW-Net approach to address the three issues mentioned above and demonstrates its superior performance on both BM examination and mitotic figure examination. The experimental results demonstrate the robustness and generalizability of the proposed CW-Net on a large BM WSI dataset with 16 456 annotated cells of 19 BM cell types and a large-scale WSI dataset for mitotic figure assessment with 262 481 annotated cells of five cell types. Availability and implementation An online web-based system of the proposed method has been created for demonstration (see https://youtu.be/MRMR25Mls1A).


Introduction
Examination of bone marrow (BM) is crucial for the diagnosis and management of many disorders of the blood and BM (Lee et al. 2008). BM nucleated differential cell count (NDC) is compulsory to assess the hematopoiesis in different cell lineages and the proportion of aberrant cells. That is, BM NDC is an invaluable assessment that not only produces a correct diagnosis but also provides a significant indicator of prognosis and disease follow-up, particularly for hematological malignancies like acute myeloid leukemia (Greenberg et al. 2012), chronic myeloid leukemia (Swerdlow et al. 2017) and multiple myeloma (Kumar et al. 2016). Compared with general pathological examinations, which usually identify only one or a few types of tumor tissues for each analysis, BM NDC analysis is much more complicated and difficult as there are >16 types of cells to be detected and classified at once. In addition, for pathological diagnosis, pathologists may conduct a microscopic assessment on WSIs using computer-assisted systems (Campanella et al. 2019), but for BM examination, BM NDC analysis is generally conducted via oil-immersion objective lens with a total 1000Â magnification that makes fully automated analysis more challenging. Apart from the diversity of cell types, challenges include delicate intralineage discrepancy within the BM cell maturation process, cells overlapping, lipid interference and stain variation, causing large intra-and interobserver variability (Chandradevan et al. 2020). The enormous size of WSIs makes automated BM NDC analysis on WSIs more difficult. In addition, a previous study shows that existing modern hematology analyzers are poor in recognition and detection of blasts, immature granulocytes and basophils (Meintker et al. 2013). In practice, BM NDC analysis requires well-trained examiners to perform cytomorphological assessment intensively from low to high magnification (Â10, Â40 up to Â100 objective magnification with oil immersion). According to International Council for Standardization in Hematology guidelines (Lee et al. 2008), in order to generate percentages of the number of required cell types for diagnosis and disease, at least 500 cells should be analyzed on each smear and at least two smears are assessed for each patient. An accurate and reliable BM NDC analysis system is highly demanded in order to improve diagnostic precision, speed and reliability and to minimize valuable human resource costs. In this study, we build a large WSI dataset with 16 456 annotated cells of 19 BM cell types to evaluate the robustness and generalizability of the proposed model.
Early screening and diagnostic information will aid in lowering death rates and in better understanding the aggressiveness of cancer stages. The Nottingham Grading System (NGS) is commonly used to grade three major tumor features: tubule development, nuclear pleomorphism and mitotic rate (Balkenhol et al. 2019), and according to the NGS, the mitotic rate has the highest predictive value among the three (Balkenhol et al. 2019). Hence, mitotic detection and identification are critical not only for accurate cancer diagnosis and grading but also for predicting therapy success and survival (Bray et al. 2018). Pathologists often do such mitotic identification tasks visually, which is time-consuming, subjective and poorly reproducible with considerable inter-and intra-rater variability due to the difficulties in recognizing mitotic figures and their varied distribution across WSIs (Bertram et al. 2020). Previous studies have found 17.0% to 34.0% interrater disagreement in distinguishing individual mitotic figures from other cell features in the canine cutaneous mast cell tumor and human breast cancer (Malon et al. 2012, Bertram et al. 2020. As a result, developing an automated computeraided approach for mitotic figure examination is highly demanded. In recent years, there have been a number of international medical image analysis challenges in the field of automatic identification of mitotic figures, such as MIDOG 2021 challenge (Aubreville et al. 2023), TUPAC16 challenge (Veta et al. 2019), ICPR MITOS-ATYPIA-2014 challenge (Roux 2014) and ICPR MITOS-2012 challenge (Ludovic et al. 2013). However, the number of annotated mitotic figures in these datasets is small (fewer than one thousand annotated cells for each dataset). To evaluate robustness and generalizability of the proposed method, we utilized a large-scale WSI dataset (Bertram et al. 2019) with 262 481 annotated cells of five cell types for mitotic figure assessment.
In this study, we present an effective, fully automatic and fast deep learning approach (CW-Net) for multitype cell detection and classification in BM examination and mitotic figure examination. The proposed weakly supervised learner is demonstrated to be useful for applications with partially annotated data and for boosting up the model performance in both object detection and classification. The rest of this article is organized as follows. Related works on weakly supervised learning, (semi)-automatic BM analysis and (semi)-automatic mitotic figure examination are described in Supplementary Section S1. Figure 1 presents sample cells of various cell types in the two datasets used in this study; see Supplementary Section S2. Section 2 describes the proposed method. The experimental results in comparison with the benchmark methods are given in Section 3. Section 4 concludes the article.

Proposed CW-Net
When the training data are partially labeled, causing many unlabeled objects of interest wrongly defined as background or contents of no interest, this severely confuses AI learners during supervised learning and deteriorates the performance of output AI models. The proposed deep learning approach is devised with (i) a Jaccard-based soft sampling weighted loss function to achieve a reasonable balance between hard examples and the rest of the background in instance sampling, (ii) a dual layer filtered negative instance sampling (FNIS) strategy to generate better detectors and classifiers, (iii) a multiclass nonmaximum-suppression (MCNMS) strategy to ensure no contradictory prediction of a cell and (iv) a data augmentation and normalization strategy to minimize generalization errors and prevent overfitting.
In routine BM examination, examiners first determine an adequate BM smear by the presence and cellularity of particles viewed under low magnification power to avoid diluted regions. Second, examiners perform BM NDC analysis within areas with well spread marrow cells in the cellular trails of the BM smear behind the particles viewed under high magnification power. In this study, we introduce an efficient and fully automatic deep learning method (CW-Net) for multitype cell differentiation of BM NDC WSI analysis in seconds. An overview of the proposed method is shown in Fig. 2. The first layer CNN model rapidly locates BM particles and cellular trails in low resolution as ROI(s), which is used to locate data for further analysis in high magnification level where the second layer CW-Net performs BM cell detection and classification. The base deep learning model of the proposed method is adapted from Cascade R-CNN (Cai and Vasconcelos 2021), which is a multistage object detection and classification method.

Proposed CW-Net architecture
The proposed CW-Net is composed of four components: a backbone, a feature pyramid network (FPNþ; Cai and Vasconcelos 2021), a region proposal network (RPN) and a prediction head. Initially, a deep ResNet101 backbone is employed to extract features, and the output of ResNet101 is then sent into FPNþ, which integrates features from different levels and creates multiple scale features by up-sampling. The multiscale feature maps created from FPNþ are fed into RPN, which generates proposal bounding boxes and assigns anchors to feature maps of varying sizes. We devised RPN with a Jaccard-based soft sampling weighted loss function, Wang et al.
which achieved state-of-the-art performance on partially annotated data in our previous study (Wang et al. 2022).
In training, soft sampling ensures that all positive and hard negatives will contribute to the gradient, but with a lower attention weight. An illustration of soft-sampling weighting mechanism is given in Fig. 2b3.

Dual layer filtered negative instance sampling
As shown in Fig. 3a, single layer binary classification systems are used in Cascade R-CNN to generate positive samples b dþ and negative samples b dÀ from candidate bounding boxes b d to train multiple classifiers with thresholds u ¼ fu z g z¼1...Z .
where u 1 ¼ 0:5; q ¼ 0:1 is the increasing confidence factor; Z is the number of classifiers, and Z ¼ 3 in this study. However, for partially labeled data, the single layer design causes serious confusions in AI learning and downgrades the resulting detectors and classifiers because a number of unlabeled cells will be used as negative samples for training. As a result, to resolve this issue caused by partial annotations, we propose a dual layer filtered negative instance sampling (DL-FNIS) strategy by adding an extra layer in selection of positive and negative samples with a new sample type, i.e. the ignored class b di Z , in consideration with the Jaccard indexbetween the reference standards and the candidate bounding boxes to the last stage model in sample classification as shown in Fig. 3a, producing positive samples b dþ Z and refined negative samples ðb dÀ Z Þ Ã for training Zth classifier and detector. The second layer binary classification system of the proposed DL-FNIS is devised to select refined negative samples for training and is formulated as follows.
where u i ¼ 0:1 is used to define ignored instances for training in this study.
To further illustrate the contributions of DL-FNIS, Fig. 3b compare the outputs of the original Cascade R-CNN, the proposed CW-Net without DL-FNIS and the proposed CW-Net with DL-FNIS using a low threshold for detectionclassification confidence of a cell, showing that even with such a low threshold, for (b) original Cascade R-CNN there are still many cells undetected, while for (c) the proposed CW-Net without DL-FNIS many cells can be detected but with very low confidence rates in classification, which however tend to be filtered out (disqualified) by common criteria such as 0.5, and in (d) the proposed CW-Net with DL-FNIS, the number of detected cells is comparably more than (b and c) and confidence rates of cells are significantly higher. The result shows the effectiveness of DL-FNIS for improving both detectors and classifiers.

Multiclass nonmaximum suppression
In the original Cascade R-CNN design, it was found that a single cell object may be misidentified as various kinds of cells because single class nonmaximum suppression (SCNMS) is utilized for inference as shown in Fig. 2c. For each class, the  (s), which is then mapped to high magnification level where the second layer network performs BM cell detection and classification inside ROI(s). (b and b2) The region proposal network (RPN) of CW-Net produces candidate bounding boxes using (b3) the proposed soft-sampling weighted loss function to decrease the influence of unlabeled data in AI training and avoid confusions among unlabeled data, targets and background. (b4) Sample classification models are then applied to generate (b5) positive b dþ and refined negative samples ðb dÀ Þ Ã using (b6) the proposed DL-FNIS to produce better detectors and classifiers in training. In inference, (c) the original SCNMS is replaced with (d) the proposed MCNMS to ensure that there is no contradictory classification initial output set o c are rendered if the probability of a detected object is greater than ..
where p ðc b d ;b d Þ is detected BM cell type probability, . is the classification threshold and . ¼ 0:5 in this study. SCNMS aims to suppress the initial detection results o c at each class to generate the SCNMS output O scnms .
where -0 is the Jaccard index between the detected BM cells (b i , b j ), and g ¼ 0.3 in this study.
In inference, SCNMS is replaced with a MCNMS strategy, which ensure no contradictory prediction of a cell. This strategy greedily selects a subset of detection bounding boxes by pruning away boxes that have high Jaccard overlap with already selected boxes.
MCNMS produces the output set O MCNMS with all class at once as formulated as follows.
where -0 is the Jaccard index between the detected BM cells (b i , b j ), and g ¼ 0.3 in this study.

Data augmentation and normalization
Methods trained with images from one hospital tend to perform poorly on images from other hospitals, even for state-ofthe-art deep learning-based methods . Minimizing generalization error is important for building a robust AI model for unseen data. Data augmentation could operate as a regularizer in neural networks, minimizing overfitting and improving performance when dealing with unbalanced classes. During training, data augmentation mimics a broad range of actual changes, generating CNNs robust to variations in stain, translation, perspective, size or lighting. Data normalization, on the other hand, is designed to reduce data variation and therefore improving model generalizability. In this work, we built a Jaccard-based data augmentation method and a data normalization process for reducing the model generalization error. First, the data augmentation is applied to the selected patches if the associated Jaccard coefficient g z k , which is determined in Equation (9), of an individual patch is >0. The selected patches are used to supplement the training set with new synthetically changed data with the following operations, including rotation per 5 8 and 5 times and increment of 90 8 , the mirror-flipped along the horizontal and vertical axes, the contrast adjusted (random contrast, range 0%620%), the saturation adjusted (random saturation, range 0%620%) and the brightness adjusted (random brightness, range 0%612.5%) during the training process.
where h is the reference positive samples, and z k is a patch. Second, a data normalization method is built to maintain data consistency and gradient stability in the training dataset, as well as to prevent overflow from data augmentation. The data normalization is performed by histogram specification, which matches the color distribution P S of input data S to a reference distribution P R trained from the ImageNet dataset (Deng et al. 2009) and produces a normalized data S 0 .

Adaptive learning
Cascade R-CNN uses a fixed step size for reducing the learning rate by 10% at 160k and 240k iterations. However, the training instances contribute less to the learned model as the learning rate decreases, and some instances may not be learned adequately if the learning rate reduces dramatically.
To address this problem, we proposed an adaptive learning (AL) technique with a flexible data-oriented learning rate adjustment mechanism (.). Given the number of training images I, the size of each image w Â h, the size of each unit patch q Â q, the AL rate r K at iteration K is formulated as follows: .
where I Â w q Â h q represents the total number of patches in the training data, excluding data augmentation, and a ¼ 10%.

Identification of 16 types of bone marrow cells
Quantitative evaluation was performed with patch-wise 10fold cross validation, for which cells on the same patches were used exclusively in 1-fold and never assigned to both training and testing set at the same time, and the proposed CW-Net was compared with three recently published benchmark approaches, including two small-image-based approaches (Yu et al. 2019, Chandradevan et al. 2020) and our previous work for BM NDC WSI analysis (Wang et al. 2022; see Table 1 for the results where the best results are highlighted in bold case and the reported numbers of Yu et al. 2019, Chandradevan et al. 2020, Wang et al. 2022 are referred). The experimental results show that the proposed CW-Net demonstrates superior performance in BM NDC analysis in WSIs with an averaged recall of 0.974 6 0.032, an averaged accuracy of 0.997 6 0.003 and an averaged PR-AUC of 0.985 6 0.027. Moreover, the proposed method consistently achieves >0.99 accuracy for all BM cell types and >0.95 recall for most types, respectively. In comparison with two recent BM NDC analysis methods (Yu et al. 2019, Chandradevan et al. 2020, which require human intervention to manually cropped small image areas, the proposed fully automatic CW-Net method outperforms both benchmark approaches for identification of all cell types. Moreover, in comparison with Cascade R-CNN, the proposed CW-Net consistently obtains higher recalls and accuracies, and importantly the proposed CW-Net achieves high recall values in identification of many cell types such as blast, promyelocyte,  CW-NET 5 myelocyte and monocyte, which are comparably low by Cascade R-CNN and notably boosted up >10%-37%. Furthermore, based on confusion matrix provided in Fig. 4, the proposed method shows excellent performance for 16 of the 17 BM cells. To sum up, the proposed CW-Net greatly improves the AI model performance, and the proposed framework could also be integrated to other CNN methods to improve model performance. Figure 5 presents sample results by the proposed method, and a discussion on the system limitation is provided in Supplementary Section S3.

Intra-and interobserver reliability analysis
Cohen's kappa statistic is used to analyze the annotation agreement of intra-and interobserver. The intraobserver analysis is performed based on two sets of annotations produced at an interval of one week. For the intraobserver variability, we perform the kappa analysis on 665 randomly selected BM cells from three WSIs, and for the interobserver variability, we perform the kappa analysis on 1966 randomly selected BM cells from five WSIs (see Supplementary Table  S1). Conventionally, a kappa value of <0.20, 0.21-0.40, 0.41-0.60, 0.61-0.80 and 0.81-1.00 are interpreted as a poor, fair, moderate, good and excellent agreement, respectively (Gianelli et al. 2014). The intraobserver reliabilities of examiner 1 and 2 are interpreted as good with kappa values of 0.608 and 0.789, respectively, and the interobserver reliability between the two examiners is also good with a kappa value of 0.8. For the interobserver analysis between AI and examiners, high kappa values of 0.824 and 0.908 were obtained, showing that the proposed AI model is reliable and highly consistent to the specialized medical examiners' decisions. Moreover, the results show that the second examiner who has >20 years of expertise in BM NDC analysis produces more consistent decisions, obtaining higher intraobserver kappa than the first examiner. In addition, the results of interobserver analysis show that the mean kappa between the proposed AI model and the senior examiner is higher than the one with the junior examiner. More information is provided in Supplementary Section S4.  Table 2 shows that the proposed CW-Net demonstrates superior performance than the benchmark methods in identification of mitotic figures with a recall of 0.843, precision of 0.858 and f1-score of 0.851 in detection and high precision of 0.841, recall of 0.876 in classification of mitotic figure, respectively. Furthermore, the proposed CW-Net without FNIS achieves the second best precision of 0.762 and f1-score of 0.761 in detection and the highest recall of 0.943 and the second best f1-score of 0.862 in classification of mitotic figure, respectively. Table 3

Conclusion
Fully automated examinations of BM slides and mitotic figures are highly demanded but challenging. First, the complexity and poor reproducibility of BM and mitotic figure examination on WSIs emerge from the cell type diversity, delicate intralineage difference within the maturation process of multitype cells, cell overlapping, lipid interference and stain variations. Second, manual annotation on WSIs with enormous data dimensions and complicated cell types is difficult, which causes the supervised information restricted to limited, easily identifiable and scattered cells annotated by human. In this article, we develop a fully automatic and efficient (1) A global view of the WSI with detection results, (2) a global view of the WSI with the location of (3) the medium zoom-in view and (4) a high-resolution zoom-in view cascaded weakly supervised deep learning framework (CW-Net) for multitype cell detection and classification for both BM examination and mitotic figure examination. Comprehensive experiments demonstrate that the proposed method has the discriminative ability in both applications and achieves state-of-the-art performance.

Supplementary data
Supplementary data are available at Bioinformatics online.

Conflict of interest
None declared.

Data availability
The data underlying this article cannot be shared publicly due to the requirement of the ethical approval. The data will be shared on reasonable request to the corresponding author.